41  Standardisation and Normalisation

In this tutorial, we’ll explore how to perform data standardisation and normalisation in R. These preprocessing steps are critical for many statistical analyses and machine learning models, as they can significantly impact performance and results.

41.1 What are Standardisation and Normalisation?

  • Standardisation (also known as Z-score normalisation) is the process of rescaling the features so they have the properties of a standard normal distribution with a mean of 0 and a standard deviation of 1.

  • Normalisation typically means rescaling the values into a range of [0, 1].

41.2 When are they used?

You would standardise your data when you need to compare features that have different units or scales, particularly in models that assume normally distributed data, such as linear regression, or when using techniques that are sensitive to variance, like principal component analysis (PCA).

Standardisation is useful because it transforms your data to have a mean of 0 and a standard deviation of 1, ensuring that each feature contributes equally to the analysis.

On the other hand, normalisation is used when you need to bound your data within a specific range, such as [0, 1], which is often required by algorithms that assume data is in a bounded interval, like neural networks, or when you’re working with models that use distance calculations, such as k-nearest neighbours (KNN).

Normalisation helps in speeding up the convergence of gradient descent algorithms by ensuring all parameters are on a similar scale.

We’ll return to normalisation when covering machine learning later in the module.

41.3 Example

We’ll start by creating two vectors X and Y. X has a normal distribution, and Y has a uniform distribution.

Show code
rm(list=ls())

set.seed(123) # Ensure reproducibility

# Generate synthetic data
data <- data.frame(
  X = rnorm(100, mean = 50, sd = 10), # Normally distributed data
  Y = runif(100, min = 200, max = 400) # Uniformly distributed data
)

# Original Data
hist(data$X, main = "Original X", xlab = "X")

Show code
hist(data$Y, main = "Original Y", xlab = "Y")

We can standardise the data by using the scale() function. This gives a mean of 0 and SD of 1 for each variable.

Show code for standardisation
library(psych)

# Standardise data
data_standardised <- as.data.frame(scale(data))

# Summary to verify standardisation
describe(data_standardised) # using the psych library
  vars   n mean sd median trimmed  mad   min  max range skew kurtosis  se
X    1 100    0  1  -0.03   -0.01 0.97 -2.63 2.30  4.93 0.06    -0.22 0.1
Y    2 100    0  1  -0.04   -0.01 1.27 -1.63 1.69  3.32 0.08    -1.28 0.1

Notice that both variables now have a mean of 0, and a SD of 1.

Show code for plotting
# Plotting
par(mfrow = c(2, 2))

# Original Data
hist(data$X, main = "X", xlab = "X")
hist(data$Y, main = "Y", xlab = "Y")

# Standardised Data
hist(data_standardised$X, main = "Standardised X", xlab = "X")
hist(data_standardised$Y, main = "Standardised Y", xlab = "Y")

We can also normalise the original data. This scales each variable to a range between 0 and 1.

Show code for normalisation
# Normalize data function
normalise <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply normalisation
data_normalised <- as.data.frame(lapply(data, normalise))

# Summary to verify normalisation
describe(data_normalised)
  vars   n mean  sd median trimmed  mad min max range skew kurtosis   se
X    1 100 0.53 0.2   0.53    0.53 0.20   0   1     1 0.06    -0.22 0.02
Y    2 100 0.49 0.3   0.48    0.49 0.38   0   1     1 0.08    -1.28 0.03
Show code for plotting
# Plotting
par(mfrow = c(2, 2))

# Original Data
hist(data$X, main = "Original X", xlab = "X")
hist(data$X, main = "Original Y", xlab = "Y")


# Normalised Data
hist(data_normalised$X, main = "Normalised X", xlab = "X")
hist(data_normalised$Y, main = "Normalised Y", xlab = "Y")

Further visualisations:

Show code for visualisations
library(ggplot2)
library(reshape)


# Add a 'Type' column to each dataset
data$Type <- 'Original'
data_standardised$Type <- 'Standardised'
data_normalised$Type <- 'Normalised'

# Combine the datasets
combined_data <- rbind(data, data_standardised, data_normalised)

# Melt the combined data for ggplot2
library(reshape2)
data_melted <- melt(combined_data, id.vars = 'Type', variable.name = 'Vector', value.name = 'Value')


# Distribution plots
ggplot(data_melted, aes(x = Value, fill = Type)) + 
  geom_histogram(alpha = 0.6, position = "identity", bins = 30) + 
  facet_wrap(~Vector, scales = 'free') + 
  theme_minimal() + 
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Distribution of Vectors by Type", x = "Value", y = "Count")

Show code for visualisations
# Box plots
ggplot(data_melted, aes(x = Vector, y = Value, color = Type)) + 
  geom_boxplot() + 
  facet_wrap(~Type, scales = 'free') + 
  theme_minimal() + 
  scale_color_brewer(palette = "Set2") + 
  labs(title = "Box Plot of Vectors by Preprocessing Type", x = "Vector", y = "Value")